© Scott Robison 2021 all rights reserved.


Linear Models and Estimation by Least Squares

For Chapter 11: Linear Models and Estimation by Least Squares, we will be covering the same material as the text book but likely not in the same order. I will list the page range for the whole of Chapter 11 but will go at my own pace and order, so please reference the text book at your own convenience

Chapter 11 pages 563-609 from the text.

Simple Linear Regression: Modelling Linear Relationship

After discovering covariance between bivariate data, you will likely want to know how to describe/express the relationship. Successful descriptions can then be used to model expected results on one of the variables deemed the response variable, \(Y\), based only off the predictor variable, \(X\).

The steps are:

  1. Collect bivariate X and Y variables from historic events. This data set will serve as “training” to understand the linear relationship that exists between the variables.

  2. Develop a mathematical expression/equation to transform an particular/hypothetical \(X\) into an expected/estimated \(Y\).

Consider, “bivariate” data expressing temperature in \(^{\circ} C\), degrees Celsius, (call this the \(X\) variable) and then in \(^{\circ} F\), degrees Fahrenheit, (call this the \(Y\) variable).

What do you notice about the scatterplot?

Do you see how all the points fall exactly on the line? This is called a deterministic model; since all points fall exactly on the line we could perfectly predict where no points exist.

The linear equation will follow the deterministic model’s form:

\[\begin{align} Y_i=\beta_0+\beta_1 X_i,&& i=1,2,…,n\\ \end{align}\]

where \(\beta_0\) is the y-intercept (the \(Y\) value when \(X=0\)) and \(\beta_1\) is the slope (rate of change in \(Y\) with respect to \(X\)).

In a deterministic model only two sample points are all that is required to find the model:

\[\begin{align} \beta_1=\frac{rise}{run}=\frac{y_2-y_1}{x_2-x_1},&& \text{ then }&\beta_0=y_1-\beta_1 x_1\\ \end{align}\]

Of course in the “real” world we often lack the ability to measure variables with deterministic precision. We expect response and or measurement bias in our observations. Additionally, when dealing with random variables we know that our observations may lack consistency not due to any bias at all!

Let’s return to our example of ten student’s height and weight, and try to select two points from the data set then create linear models…

So which model of a non-deterministic data set is best?… The probabilistic model appears to the same as the deterministic model, however, it includes an addition \(\varepsilon_i\) term:

\[\begin{align} Y_i=\beta_0+\beta_1 X_i+\varepsilon_i,&& i=1,2,…,n,&& \text{ where }\varepsilon_i\sim Norm(\widehat{\beta}_0+\widehat{\beta}_1 X_i,\sigma^2) \end{align}\]

The probabilistic model can also be written this way, to “hide” the \(\varepsilon_i\) term by admitting the following values are estimates:

\[\begin{align} \widehat{Y}_i=\widehat{\beta}_0+\widehat{\beta}_1 X_i,&& i=1,2,…,\\ \end{align}\]

\(\widehat{Y}=\widehat{\beta}_0+\widehat{\beta}_1 X\); The Least-Squares Estimate Model

Let \(\widehat{Y}\) be the probabilistic model that is the closest “overall” to each sample point, meaning the set of sample points \((X_i,Y_i )\)’s that are respectively closest to the \((\widehat{X}_i,\widehat{Y}_i )\)’s. Then we define the difference between \(Y_i\) and \(\widehat{Y}_i\) to be \(\varepsilon_i\).

Then \(\varepsilon_i=Y_i-\widehat{Y}_i\) (residuals/errors/residual errors), we are setting \(\sum_{i=1}^n \varepsilon_i =\sum_{i=1}^n(Y_i\widehat{Y}_i)=0\) so we can find the model with the least overall error!

One complication that comes up due to setting the sum of errors equal to zero is that you have now clearly made the signs of some of the errors negative.

To overcome this we square the individual error terms and then discuss the SUM of SQUARED ERRORS,

\[SSE=\sum_{i=1}^n\varepsilon_i^2 =\sum_{i=1}^n(Y_i-\widehat{Y}_i)^2 =\sum_{i=1}^n(Y_i- (\widehat{\beta}_0+\widehat{\beta}_1 X))^2\]

We wish to estimate the probabilistic model in such a way that the square of these vertical distances is as small as possible, a method that invokes least-squares estimation. Consider the sum of the squared distances/errors, SSE, each bivariate data point lies away from the imaginary linear line, \(\widehat{Y}=\widehat{\beta}_0+\widehat{\beta}_1 X\).

We need to minimize SSE with respect to \(\widehat{\beta}_0\) then with respect to \(\widehat{\beta}_1\) by finding then setting the partial derivatives equal to zero. \(\frac{\delta SSE}{\delta\widehat{\beta }_i}\),where \(i=0,1\)

We will see:

The least-squares estimate of the Y -intercept of the model is:

\[\widehat{\beta}_0=\overline{Y}-\widehat{\beta}_1\overline{X}\]

The least-squares estimate of the slope of the model is:

\[\begin{align} \widehat{\beta}_1&=\frac{S_{XY}}{S_{XX}} =\frac{S_{XY}}{S_X S_X }=\frac{S_{XY}}{S_X^2 }=\frac{r S_y}{S_X} =\frac{\sum_{i=1}^n[(X_i-\overline{X} )(Y_i-\overline{Y })] }{\sum_{i=1}^n(X_i-\overline{X})^2 }\\ &=\frac{\sum_{i=1}^n(X_i Y_i )-n\overline{X}\overline{Y} }{\sum_{i=1}^n(X_i-\overline{X})^2 }=\frac{\sum_{i=1}^n(X_i Y_i )-n\overline{X} \overline{Y } }{\sum_{i=1}^nX_i^2-n \overline{X }^2 }\\ \end{align}\]

\(\frac{\delta SSE}{\delta\widehat{\beta }_0}\)

\(\frac{\delta SSE}{\delta\widehat{\beta }_1}\)

Looking for solved notes?

Example 1

Let’s look back at our student height and weight data. Found in data file in D2L, using height as the predictor and weight as the response.

X <- c(63,64,66,69,69,71,71,72,73,75);
Y <- c(127,121,142,157,162,156,169,165,181,208);
Student=1:10;


data.frame("Student ID"=Student,height=X,weight=Y)
reg1=data.frame("Student ID"=Student,height=X,weight=Y)

cor(reg1$height,reg1$weight)
## [1] 0.9470984
cor(reg1$height,reg1$weight)^2
## [1] 0.8969953
plot(reg1$height,reg1$weight)


fit=lm(weight~height, data=reg1)

abline(fit)

fit
## 
## Call:
## lm(formula = weight ~ height, data = reg1)
## 
## Coefficients:
## (Intercept)       height  
##    -266.534        6.138
fit$coefficients[2]
##   height 
## 6.137581
predict(fit,data.frame(height=2.5*12))
##         1 
## -82.40695
predict(fit,data.frame(height=67))
##        1 
## 144.6836
157-predict(fit,data.frame(height=69))
##          1 
## 0.04127444

Solution

Please try this on your own before looking at the solution

Example 2

How strong is the linear relationship between the age of a driver and the distance the driver can see? A research firm (Last Resource, Inc., Bellefonte, PA) collected data on a sample of \(n = 30\) drivers. What can you say about the relationship?

.csv file [right click “download linked file” or “save link as”, let’s import into R]

Solution

Import .csv file

fit=lm(Distance~Age,data = File)
summary(fit)
## 
## Call:
## lm(formula = Distance ~ Age, data = File)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -78.231 -41.710   7.646  33.552 108.831 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 576.6819    23.4709  24.570  < 2e-16 ***
## Age          -3.0068     0.4243  -7.086 1.04e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.76 on 28 degrees of freedom
## Multiple R-squared:  0.642,  Adjusted R-squared:  0.6292 
## F-statistic: 50.21 on 1 and 28 DF,  p-value: 1.041e-07
cov(File$Distance,File$Age)
## [1] -1425.862
cor(File$Distance,File$Age)
## [1] -0.8012447
plot(Distance~Age,data = File)
curve(fit$coefficients[1]+fit$coefficients[2]*x,add=T,col="red")

Please try this on your own before looking at the solution